3635: Fixes a memory leak issue in the Avro C library under abnormal data scenarios.#31
3635: Fixes a memory leak issue in the Avro C library under abnormal data scenarios.#31martin-augment wants to merge 1 commit intomainfrom
Conversation
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical memory leak vulnerability within the Avro C library, specifically targeting scenarios involving malformed or abnormal input data. By introducing a mechanism to determine the maximum readable buffer size and a new safe read macro that ensures proper memory deallocation upon read failures, the changes significantly enhance the library's robustness and prevent resource exhaustion when processing potentially malicious or corrupted data streams. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThe changes add a new public API function ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review: Memory Leak Fix in Avro C LibrarySummaryThis PR addresses a memory leak issue in the Avro C library when handling abnormal/malformed data. The fix adds bounds checking and safe cleanup mechanisms for string and bytes reading operations. Code Quality & Best Practices✅ Strengths
|
There was a problem hiding this comment.
Code Review
This pull request aims to address a memory leak issue in the Avro C library by introducing robust error handling and memory management, specifically through the new avro_max_read function and AVRO_SAFE_READ macro. However, the current implementation has critical security and functional issues. The read_bytes function is missing OOM protection and is vulnerable to a heap buffer overflow via integer overflow on 32-bit systems. Furthermore, the avro_max_read implementation for file IO is broken, which will cause the library to fail when reading strings longer than 4KB from files, leading to a significant regression and denial of service.
| AVRO_READ(reader, *bytes, *len); | ||
|
|
||
| (*bytes)[*len] = '\0'; | ||
| AVRO_SAFE_READ(reader, *bytes, *len, *len+1); |
There was a problem hiding this comment.
The read_bytes function is missing the avro_max_read check that was added to read_string. This allows an attacker to trigger large memory allocations by providing a large length in the binary data, leading to Denial of Service (OOM). Additionally, the expression *len + 1 used in the allocation (line 134) and the AVRO_SAFE_READ macro (line 141) is susceptible to integer overflow. On 32-bit systems, if *len is INT64_MAX, *len + 1 wraps to 0, leading to a small allocation followed by an out-of-bounds write at line 140.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Gemini AI reviewer is correct! The addition of the extra null byte is problematic! If it is added before the safe read then it may lead to integer overflow when len=i64::MAX or the safe read fails then the appended \0 should be removed too. Adding it after the safe is safer because it does not have these issues.
| return mem_reader->len - mem_reader->read; | ||
| } else if (is_file_io(reader)) { | ||
| struct _avro_reader_file_t *file_reader = avro_reader_to_file(reader); | ||
| return bytes_available(file_reader); |
There was a problem hiding this comment.
The implementation of avro_max_read for file-based readers is incorrect. It returns the number of bytes currently in the internal buffer (reader->end - reader->cur), which is at most 4096 bytes. This causes read_string to incorrectly reject any string longer than 4KB with an ERANGE error, even if the file contains the full string. For file readers where the total size is not readily available in the buffer, this should return -1 to bypass the length check and avoid a denial of service for legitimate data.
return -1;| #define AVRO_SAFE_READ(reader, buf, len, mem_size) \ | ||
| { int rval = avro_read( reader, buf, len ); if(rval) { if(buf) avro_free(buf, mem_size); buf = NULL; return rval; } } |
There was a problem hiding this comment.
The AVRO_SAFE_READ macro introduces complex logic, including conditional memory deallocation and an early return. While effective for safety, macros with control flow can sometimes lead to unexpected behavior, make debugging more challenging, and reduce readability compared to a static inline function. A function would provide better type checking and scope management, reducing the risk of misuse.
Consider encapsulating this logic in a static inline function for improved maintainability and robustness, if the project's coding standards allow for it.
🤖 Augment PR SummarySummary: Hardens Avro C binary decoding to avoid leaks and improve safety under abnormal/corrupt input scenarios.
🤖 Was this summary useful? React with 👍 or 👎 |
| return mem_reader->len - mem_reader->read; | ||
| } else if (is_file_io(reader)) { | ||
| struct _avro_reader_file_t *file_reader = avro_reader_to_file(reader); | ||
| return bytes_available(file_reader); |
There was a problem hiding this comment.
In avro_max_read(), the file-IO branch returns bytes_available(file_reader), which only reflects the current internal buffer (and can be 0/undefined before the first fill). Callers like read_string() can then incorrectly reject valid reads when the requested size exceeds the currently buffered bytes.
Other Locations
lang/c/src/encoding_binary.c:194
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The new functionality is not the same for memory based reads and file-based ones. For inmemory reads it returns the complete length of the bytes which could be read, while for files it returns only the free bytes in the buffer, not the bytes remaining to be read from the file itself.
| return EINVAL; | ||
| } | ||
| // max := r.tail - r.head + 1; if max >= 0 && size > max | ||
| max_available = avro_max_read(reader); |
There was a problem hiding this comment.
While adding the max_available guard, consider also guarding against overflow in the str_len + 1 size computation/allocation. With a very large str_len from corrupted input, signed overflow or a wrap when converting to size_t could lead to an undersized allocation and an out-of-bounds read/write.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The addition of the extra null byte is problematic! If it is added before the safe read then it may lead to integer overflow when len=i64::MAX or the safe read fails then the appended \0 should be removed too. Adding it after the safe is safer because it does not have these issues.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@lang/c/src/encoding_binary.c`:
- Around line 193-199: The comment above the max_available check is inaccurate
and the error message wrongly prefixes "mem io:" even though the check is used
for all reader types; update the comment to reflect the actual computation used
by avro_max_read (e.g., "max := len - read for memory / end - cur for file"
without "+1") and change the avro_set_error call in the block using
max_available, str_len and avro_max_read to use a generic prefix such as "io:"
or "reader:" (remove "mem io:") so the message correctly reflects both memory
and file readers.
In `@lang/c/src/io.c`:
- Around line 277-288: avro_max_read currently returns only buffered bytes for
file readers (via bytes_available), causing read_string bounds checks to fail
for valid reads; change the file branch in avro_max_read (is_file_io /
avro_reader_to_file) to compute total remaining bytes in the underlying file
instead of just the buffer by using the file's size and current absolute read
position (e.g., use fseeko/ftello or equivalent on file_reader->file to get
file_size and current_offset and return file_size - current_offset +
bytes_available(file_reader)), so callers like read_string in encoding_binary.c
get the true remaining byte count.
🧹 Nitpick comments (3)
lang/c/src/avro/io.h (1)
55-55: Consider adding documentation for the new public API.The new
avro_max_readfunction is a public API addition. Adding a brief comment describing its behavior (returns available bytes for memory readers, buffered bytes for file readers, or -1 for unknown types) would improve maintainability and help consumers understand the semantics.📝 Suggested documentation
+/* + * Returns the number of bytes available to read from the reader. + * For memory readers, returns remaining bytes (len - read). + * For file readers, returns buffered bytes available. + * Returns -1 for unknown reader types. + */ int64_t avro_max_read(avro_reader_t reader);lang/c/src/encoding_binary.c (2)
139-142: AVRO_SAFE_READ correctly prevents memory leak on read failure.The change properly frees the allocated buffer if
avro_readfails, fixing the memory leak. Note thatread_bytesdoes not have the sameavro_max_readpre-check thatread_stringhas—consider adding it for consistency to prevent large allocations on invalid input.♻️ Optional: Add bounds check for consistency with read_string
static int read_bytes(avro_reader_t reader, char **bytes, int64_t * len) { int rval; + int64_t max_available; check_prefix(rval, read_long(reader, len), "Cannot read bytes length: "); if (*len < 0) { avro_set_error("Invalid bytes length: %" PRId64, *len); return EINVAL; } + max_available = avro_max_read(reader); + if (max_available >= 0 && *len > max_available) { + avro_set_error("Bytes length %" PRId64 " exceeds available buffer %" PRId64, + *len, max_available); + return ERANGE; + } *bytes = (char *) avro_malloc(*len + 1);
193-199: Inconsistent indentation style.Lines 193-199 use spaces for indentation while the rest of the file uses tabs. This should be aligned with the file's existing style.
| // max := r.tail - r.head + 1; if max >= 0 && size > max | ||
| max_available = avro_max_read(reader); | ||
| if (max_available >= 0 && str_len > max_available) { | ||
| avro_set_error("mem io: String length %" PRId64 " is greater than available buffer size %" PRId64, | ||
| str_len, max_available); | ||
| return ERANGE; | ||
| } |
There was a problem hiding this comment.
Fix inaccurate comment and misleading error message.
- Line 193: The comment formula
max := r.tail - r.head + 1doesn't match the implementation (len - readfor memory,end - curfor file—no+1). - Line 196: The error message prefix
"mem io:"is misleading since this check applies to all reader types (memory and file).
📝 Suggested fix
- // max := r.tail - r.head + 1; if max >= 0 && size > max
+ // Bounds check: reject if string length exceeds available bytes
max_available = avro_max_read(reader);
if (max_available >= 0 && str_len > max_available) {
- avro_set_error("mem io: String length %" PRId64 " is greater than available buffer size %" PRId64,
+ avro_set_error("String length %" PRId64 " exceeds available buffer %" PRId64,
str_len, max_available);
return ERANGE;
}🤖 Prompt for AI Agents
In `@lang/c/src/encoding_binary.c` around lines 193 - 199, The comment above the
max_available check is inaccurate and the error message wrongly prefixes "mem
io:" even though the check is used for all reader types; update the comment to
reflect the actual computation used by avro_max_read (e.g., "max := len - read
for memory / end - cur for file" without "+1") and change the avro_set_error
call in the block using max_available, str_len and avro_max_read to use a
generic prefix such as "io:" or "reader:" (remove "mem io:") so the message
correctly reflects both memory and file readers.
|
|
||
| int64_t avro_max_read(avro_reader_t reader) | ||
| { | ||
| if (is_memory_io(reader)) { | ||
| struct _avro_reader_memory_t *mem_reader = avro_reader_to_memory(reader); | ||
| return mem_reader->len - mem_reader->read; | ||
| } else if (is_file_io(reader)) { | ||
| struct _avro_reader_file_t *file_reader = avro_reader_to_file(reader); | ||
| return bytes_available(file_reader); | ||
| } | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for all usages of avro_max_read
rg -n "avro_max_read" --type=cRepository: martin-augment/avro
Length of output: 268
🏁 Script executed:
# Check the implementation of bytes_available
rg -n -B5 -A10 "bytes_available" lang/c/src/io.cRepository: martin-augment/avro
Length of output: 2994
🏁 Script executed:
# Look at read_string implementation in encoding_binary.c
rg -n -B3 -A15 "read_string" lang/c/src/encoding_binary.cRepository: martin-augment/avro
Length of output: 1265
🏁 Script executed:
# Check file reader structure and implementation
rg -n "_avro_reader_file_t" --type=cRepository: martin-augment/avro
Length of output: 902
🏁 Script executed:
# Check file reader initialization and buffer state
rg -n -B2 -A8 "avro_reader_new_file" lang/c/src/io.cRepository: martin-augment/avro
Length of output: 45
🏁 Script executed:
# Verify the file reader buffer initialization and typical usage patterns
rg -n -B5 -A10 "reader->buffer" lang/c/src/io.c | head -60Repository: martin-augment/avro
Length of output: 1627
File reader returns only buffered bytes, not total remaining bytes.
For file-based readers, bytes_available(file_reader) returns only the bytes currently in the internal buffer (reader->end - reader->cur), not the remaining bytes in the file. This creates asymmetric behavior:
- Memory readers: returns total remaining bytes (
mem_reader->len - mem_reader->read) - File readers: returns only buffered bytes (potentially zero if buffer is empty)
When read_string in encoding_binary.c (line 194-197) uses this for bounds checking, it incorrectly rejects valid reads from file-based readers when the requested string length exceeds the current buffer size, even though avro_read_file has logic to refill the buffer from the file and fulfill such reads.
Note: The error message in read_string says "mem io", indicating the check was intended for memory-based IO only, but it applies equally to file-based readers.
🤖 Prompt for AI Agents
In `@lang/c/src/io.c` around lines 277 - 288, avro_max_read currently returns only
buffered bytes for file readers (via bytes_available), causing read_string
bounds checks to fail for valid reads; change the file branch in avro_max_read
(is_file_io / avro_reader_to_file) to compute total remaining bytes in the
underlying file instead of just the buffer by using the file's size and current
absolute read position (e.g., use fseeko/ftello or equivalent on
file_reader->file to get file_size and current_offset and return file_size -
current_offset + bytes_available(file_reader)), so callers like read_string in
encoding_binary.c get the true remaining byte count.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The new functionality is not the same for memory based reads and file-based ones. For inmemory reads it returns the complete length of the bytes which could be read, while for files it returns only the free bytes in the buffer, not the bytes remaining to be read from the file itself.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| return mem_reader->len - mem_reader->read; | ||
| } else if (is_file_io(reader)) { | ||
| struct _avro_reader_file_t *file_reader = avro_reader_to_file(reader); | ||
| return bytes_available(file_reader); |
There was a problem hiding this comment.
File reader incorrectly rejects valid string reads
High Severity
The avro_max_read function for file I/O returns bytes_available(file_reader), which only reflects bytes currently buffered (up to 4096), not total remaining file data. When read_string uses this to validate string length, it incorrectly rejects valid strings from files when str_len exceeds the current buffer contents, even though avro_read_file can read more by refilling from the file. For file readers, the function should return -1 to skip the validation check.
Additional Locations (1)
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The new functionality is not the same for memory based reads and file-based ones. For inmemory reads it returns the complete length of the bytes which could be read, while for files it returns only the free bytes in the buffer, not the bytes remaining to be read from the file itself.
value:useful; category:bug; feedback:The Claude AI reviewer is correct! The addition of the extra null byte is problematic! If it is added before the safe read then it may lead to integer overflow when len=i64::MAX or the safe read fails then the appended \0 should be removed too. Adding it after the safe is safer because it does not have these issues. |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! The newly added safe read functionality is used for reading strings but not for reading bytes. It should be used for bytes too. Prevents unsafe bytes read. |
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The newly added safe read functionality is used for reading strings but not for reading bytes. It should be used for bytes too. Prevents unsafe bytes read. |
3635: To review by AI